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Abstract 



DNA microarrays are a relatively new technology that can simultaneously mea- 
sure the expression level of thousands of genes. They have become an im- 
portant tool for a wide variety of biological experiments. One of the most com- 
mon goals of DNA microarray experiments is to identify genes associated with 
biological processes of interest. Conventional statistical tests often produce 
poor results when applied to microarray data due to small sample sizes, noisy 
data, and correlation among the expression levels of the genes. Thus, novel 
statistical methods are needed to identify significant genes in DNA microarray 
experiments. This article discusses the challenges inherent in DNA microarray 
analysis and describes a series of statistical techniques that can be used to 



overcome these challenges. The problem of multiple hypothesis testing and its 
relation to microarray studies is also considered, along with several possible 
solutions. 



High-dimensional biological data sets have become increasingly common in 
recent years. Examples include data collected from DNA microarrays, com- 
parative genome hybridization experiments, mass spectrometry, genome-wide 
association studies, and DNA/RNA sequencing. These new technologies have 
revolutionized our understanding of the genetics of human disease and numer- 
ous other biological processes. However, statistical analysis of such data sets 
is challenging for several reasons. These data sets are high-dimensional, and 
the sample sizes are often small. Moreover, many of these data sets tend to 
be "noisy," and the correlation between the features that are measured can be 
complex. For these reasons conventional statistical methods often produce un- 
satisfactory results when applied to modern high-dimensional biological data. 

The present study focuses on one of the most common problems in the analy- 
sis of high-dimensional biological data, which is the identification of significant 
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genes in DNA microarray studies. This is one of the best-studied problems in 
the analysis of high-dimensional biological data sets, and many of the methods 
that are applied to this problem may also be applied to other types of high- 
dimensional biological data. In a typical microarray study, one may wish to 
identify genes that are associated with a disease or some other biological pro- 
cess of interest. For example, one might attempt to identify genes associated 
with a disease by collecting a set of biological samples from diseased patients 
and another set of samples from healthy patients. Genes whose expression 
levels differ between the diseased samples and the control samples may be 
associated with the disease of interest. Alternatively, one might wish to identify 
genes that may be used to predict the prognosis of patients with a specific type 
of cancer. One might identify such genes by collecting tumor samples from 
a cohort of cancer patients and searching for genes whose expression levels 
are associated with the survival times of the patients. Ultimately, this informa- 
tion may be used for personalized treatment of cancer and other diseases. If 
the gene expression profile of a tumor indicates that the risk of metastasis is 
high, then the cancer should be treated more aggressively than another tumor 
whose gene expression suggests a low risk of metastasis. 

This article consists of three main sections. In the first section, we will briefly 
describe DNA microarray technology and how DNA microarray data is col- 
lected. In the second section, we will provide a brief overview of some of the 
methods that have been used to identify significant genes in DNA microarray 
experiments. Numerous methods have been proposed in recent years, and 
space does not permit a detailed discussion of all possible methods. We have 
attempted to focus on several of the most commonly used approaches, along 
with an overview of some of the common principles and techniques used in 
these methods. We also briefly describe a few more recent methods for com- 
bining information across genes. In the final section, we discuss the problem of 
multiple hypothesis testing, which inevitably arises when identifying significant 
features in high-dimensional data sets. 



DNA Microarray Data 

Overview of Molecular Biology 

Each organism's genetic information is contained in a molecule called deoxyribonu- 
cleic acid, more commonly known as DNA. DNA is a double-stranded molecule that 
is a chain of four possible nucleotides, namely adenine (A), cytosine (C), guanine (G), 
and thymine (T). The two strands of DNA are joined to one another by hydrogen bonds 
between nucleotides on the opposite strands. A always pairs with T, and G always pairs 
with C. Thus, if the sequence of one strand of DNA is known, then the sequence of the 
other stand is also known. Each such pair of bonded nucleotides is known as a base 
pair. 



There are approximately 3.2 billion base pairs in the human genome (i.e. the entire 
sequence of DNA in a given human cell)-i. Different segments of DNA perform differ- 
ent functions, and much of the DNA performs no known function. The DNA segments 
of primary interest in most studies are the segments which contain instructions for 
building proteins. These segments are known as genes, and they comprise about 1.5% 
of the DNA sequence in humans^. Proteins perform most of the important functions 
in cells, including metabolism, DNA replication and repair, and communication with 
other cells. 

The information contained in DNA is converted to proteins in a two-step process: In 
the first step, known as transcription, a given sequence of DNA is transcribed into an 
intermediary called messenger ribonucleic acid (mRNA), which is a single-stranded 
molecule that contains a copy of the complements of the base sequence of the DNA. 
The one difference is that thymine is replaced by uracil (U). In the second step, known 
as translation, the sequence of base pairs in the mRNA is translated into a protein, 
which is composed of a sequence of amino acids. Each set of three base pairs in the 
mRNA corresponds to one of 20 amino acids, a relationship that is known as the ge- 
netic code. This process by which the information in the sequence of DNA is converted 
to mRNA and then to proteins is known as the fundamental dogma of molecular biol- 
ogy. See Dudoit et al.— for a discussion of this process and its relationship to DNA 
microarray data. 



DNA Microarray Technology 

An important implication of the fundamental dogma of molecular biology is that there 
should be a strong association between the presence of a given protein in a cell and 
the presence of the mRNA sequence that is transcribed to build that protein. If a pro- 
tein is active in a given cell, there should be a large number of copies of the mRNA 
sequence corresponding to that protein. Conversely, if a protein is not active in a cell, 
there should be few copies of the corresponding mRNA sequence. Thus, DNA mi- 
croarray s attempt to evaluate the presence or absence of proteins in a cell and their 
relative abundance by measuring the relative abundance of the corresponding mRNA 
sequences. 

DNA microarrays measure the relative abundance of mRNA sequences in the cells in 
a sample by taking advantage of complementary base pairing. Recall that in a DNA 
(or RNA) sequence, C always pairs with G and A always pairs with T (or U). A DNA 
microarray is typically constructed by placing an array of probes on a glass micro- 
scope slide. Each probe consists of a sequence of nucleotides that is complementary 
to the nucleotide sequence of a specific mRNA or its corresponding DNA sequence. 
Thus, one can measure the expression level of a given gene by measuring the amount 
of mRNA that hybridizes to the spot on the microarray corresponding to the gene. 
Different forms of DNA microarrays exist, such as oligonucleotide microarrays— and 
cDNA microarrays^, but all of the most commonly used microarrays are based on this 
principle. 



[Figure 1 about here.] 

Figure Q] illustrates a typical (cDNA) microarray experiment. Two samples are col- 
lected, namely an experimental sample and a control sample. For example, the ex- 
perimental sample may contain tissue from a cancerous tumor, and the control sample 
may contain non-cancerous tissue from the same location in the body. First, mRNA is 
extracted from both samples. The extracted mRNA is treated with an enzyme called 
reverse transcriptase to convert it to a complementary DNA (or cDNA) sequence. Then 
each sample is treated with a fluorescent dye. Typically the red dye Cy5 and the green 
dye Cy3 are used. Equal amounts of the two samples are then hybridized onto an array. 
To determine which genes are expressed at a high (or a low) level in the experimental 
group compared to the control group, one may measure the ratio of Cy5 to Cy3 at the 
probe on the array corresponding to that gene. For example, suppose that the experi- 
mental sample was treated with the red dye and the control sample was treated with the 
green dye. Then a red spot on the array indicates that there was more mRNA for that 
particular gene produced by the experimental group than the control group, indicating 
that this gene is expressed at a higher level in the control group. An image of a DNA 
microarray slide is shown in Figure |2] 

[Figure 2 about here.] 

Before microarray data is analyzed, the ratio of red dye to green dye at each spot on 
the array is measured using an appropriate scanner. This ratio is stored in a large data 
matrix. Typically each row of the data matrix contains all the measured expression 
levels for a given gene, and each column of the data matrix corresponds to a particular 
sample. Such a data set is often visualized in the form of a heat map, as shown in 
Figure[3] The task of the microarray data analyst is to answer the biological question(s) 
of interest using this data matrix. 

[Figure 3 about here.] 

It is important to attempt to remove extraneous variation in microarray data prior to data 
analysis. Variations in design of the arrays, sample preparation and scanner reading can 
produce "batch effects" where some subset of samples exhibit systematic differences 
in gene expression that are unrelated to the biological process of interest. Failure to 
account for such batch effects can result in spurious findings.^ Thus, normalization 
is often necessary to remove batch effects. Numerous methods have been proposed 
to normalize microarray data— ~— . A detailed description of these normalization meth- 
ods are beyond the scope of this review; see the aforementioned references for more 
information. 



Methods for Identifying Significant Features in DNA Mi- 
croarray Data 

Perhaps the most common objective of microarray experiments is to identify genes that 
are associated with a biological process of interest. For example, one may wish to iden- 
tify genes associated with a disease of interest by comparing the expression levels of 
genes in diseased samples to the corresponding expression levels in healthy samples. 
Other outcomes of interest are also possible. For example, in cancer studies, one fre- 
quently wishes to identify genes associated with the survival time of cancer patients. 
The motivation is that genes that are associated with lower survival are likely to be 
associated with more serious forms of cancer that require more aggressive treatment. 

In statistical terms, one has a large number of features (genes) and an outcome variable 
(e.g. disease versus control, survival time, etc.). The objective is to identify genes that 
are associated with the outcome variable. In principle, this objective can be accom- 
plished using conventional statistical methods. To compare the expression of genes 
between two groups, one may calculate a t-test statistic for each gene. If there are 
three or more groups, an ANOVA F-test statistic may be used. To find genes associated 
with a continuous outcome variable, one may calculate a standardized regression co- 
efficient, and to find genes associated with a survival outcome, one may calculate the 
Cox score for each genei&i^. 

However, these conventional methods often perform poorly on microarray data sets for 
several reasons, which will be discussed in more detail below. DNA microarray data 
sets are frequently noisy, and sample sizes are often small. Moreover, the gene ex- 
pression levels are often highly correlated with one another, and failing to account for 
this fact may result in a loss of power. Also, one will typically perform several thou- 
sand hypothesis tests in a microarray experiment, so specialized methods are needed to 
control for type I error. 

Throughout the remainder of this section, we will assume that one is comparing two 
different conditions using a i-test or a variation of the conventional t-test However, the 
methods discussed below are easily generalized to other test statistics, such as ANOVA 
F-tests, standardized regression coefficients, and Cox scores. 

Fold Change Methods 

One simple method for identifying differentially expressed features is to compute the 
average value of each feature under each condition and then compute the ratio of these 
averages. If the ratio exceeds some arbitrary cutoff, then the difference is called "sig- 
nificant." For example, a gene may be called "significant" if the average expression 
level of a gene is more than twice as large (or less than half as large) in one condition 
compared to the other. 

This approach has the benefit of simplicity, and it has been used in previous microarray 
studies-22i2!. However, this method has some serious shortcomings. It is not based on 



a formal statistical test, so there is no simple way to calculate a p-value or confidence 
interval or other measure of the statistical validity of the association. Moreover, it is 
easy to see that this fold change has higher variance for genes expressed at lower levels, 
which is true of the majority of genes in microarray studies— ^2i. For these reasons fold 
change methods are generally accepted to be inferior to other methods for identifying 
differentially expressed features— ~—. 



T-Tests 

An alternative approach is to identify significant genes based on a two-sample i-test 
of the null hypothesis that the mean expression level of the gene is the same under 
both conditions. This approach has also been used in microarray studies^, and it has 
several advantages over fold change methods. It is straightforward to calculate p-values 
and confidence intervals using t-tests, and for large samples the distribution of the t- 
statistic is independent of the overall expression level of the gene. In contrast, fold 
change statistics have higher variance for genes expressed at low levels. 

Unfortunately, using t-tests to identify differentially expressed genes can be problem- 
atic when the sample size is small, which is commonly the case in microarray exper- 
iments. It can be difficult to obtain accurate estimates of the variances of each group 
when the sample sizes are small. In particular, if the the estimated variance of a gene is 
small, which occurs frequently when a gene is expressed at low levels^, then the gene 
may have a large ^-statistic even if the fold change is small. 



Alternative Versions of T-Tests 

Given the shortcomings of i-tests described above, numerous authors have proposed 
alternative versions of i-tests for identifying significant features in gene expression 
data. Typically these methods combine data from all the genes to obtain a regularized 
estimator of the variance of a particular gene. In general, such variance estimates are 
biased. However, since the usual estimator of the variance has high variance when the 
sample size is low, a biased estimator of the variance may have lower prediction error 
than an unbiased estimator, since these biased estimators have lower variance than the 
unbiased estimator. This is especially true when the sample size is small. See Hastie 
et al.— for a more detailed discussion of this phenomenon, which is commonly referred 
to as the "bias-variance trade-off." 

An example of the bias-variance trade-off is shown in Figure |4] Suppose the objective 
is to predict y based on x given the data in the figure. If one predicts y using a linear 
regression estimator based on x, the variance will be relatively low, but the bias will be 
high, since it cannot model the nonlinear relationship between x and y. At the other 
extreme, if one predicts y by interpolating the data with a smoothing spline, the bias 
will be 0, since the interpolation function can model any arbitrary relationship between 
x and y. However, the variance will be high, since the predicted values of y may change 
drastically if such a model is fit to a new data set. 



[Figure 4 about here.] 

Figure|5]shows how the bias and variance of a series of models varies as the complexity 
of the model increases. Each model in this figure represents a smoothing spline^ fit 
to the data from Figure [4] As the complexity of the model increases, the variance of 
the model increases and the bias of the model decreases. One attempts to choose the 
model complexity that minimizes the expected prediction error or mean squared error 
(MSE), which can be shown to be equal to the sum of the variance, the square of the 
bias, and an irreducible error term due to unexplainable variance in y^L. 

[Figure 5 about here.] 

These figures illustrate why a regularized (i.e. biased) estimator of the variance of the 
genes may produce better results when identifying significant features based on mi- 
croarray data. By regularizing the estimates of the variance, the complexity of each 
individual model is reduced, increasing the bias of the model but decreasing the vari- 
ance. If the decrease in variance is sufficient to offset the increase in bias, the accuracy 
of the overall model may be increased. 

One possible approach is to estimate the variance of each gene by using the pooled 
estimator of the variance of all genes. Although this method has been used for several 
microarray studies—^, it also has some serious shortcomings. This obviously assumes 
that the variance of the expression levels of all genes are approximately the same, which 
is unlikely to be true in most situations. More importantly, since the denominator of 
the i-test will be the same for all genes, this method is essentially equivalent to the fold 
change method, since it selects the genes with the largest mean differences without 
regard for the variance of an individual gene and thus suffers from the same drawbacks 
as fold change methods. In terms of the bias-variance trade-off, this pooled variance 
estimate has low variance but high bias. 

An alternative approach is to combine the variance estimator of each gene with some 
sort of pooled estimator of the variance across the genes. This avoids the high variance 
that results from estimating the variance of each gene individually as well as the high 
bias that results from relying entirely on a pooled variance estimate. For example, the 
"Significance Analysis of Microarrays" (SAM) procedure of Tusher et al.— uses the 
following test statistic: 

U = %^ (1) 

Si + So 

Here ti represents the i-statistic for the ith gene, and Xi and Yi represent the mean ex- 
pression level of the gene under each experimental condition. The variance is estimated 
by summing the estimated variance of the zth gene (denoted by s,) and a normalizing 
constant sq. This normalizing constant reduces the variance of the estimator of the vari- 
ance and hence reduces the likelihood of obtaining false positive findings as a result of 
genes whose estimated variance is small. Typically sq is chosen to be some quantile 
(such as the median) of the s^'s across all of the genes. The SAM software is publicly 



available as an add-in for Microsoft Excel (http://www-stat.stanford.edu/tibs/SAM/). 
It is also implemented in the "samr" R package. 

Other normalized estimators of the variance of microarray s amples have been proposed. 
For example Huber et al.— apply an arsinh transformation to the gene expression data 
that is designed to produce stable variance estimates irrespective of the gene's over- 
all expression level. This method is implemented in the "vsn" R package (available 
through the Bioconductor project at http://www.bioconductor.org). Cui et al.— com- 
bine gene-specific and between-gene variance estimates using a James-Stein estima- 
tor—. R code for implementing this method is available at 

http://www.stjuderesearch.Org/depts/biostats/documents/cui-Fstat.R. In general, any method 
to reduce the variance in the estimates of the variances of the individual genes can pro- 
duce more accurate results when the sample size is small. 



Bayesian Methods 

Bayesian methods can also be used to combine information across genes to avoid in- 
accurate variance estimates as a result of small sample sizes. Typically these methods 
impose some type of Bayesian prior distribution on the gene expression data and es- 
timate the posterior distribution for each gene by combining information across all of 
the genes. For example, Baldi and Long— impose a prior distribution on the variances 
of the genes to obtain the following regularized t-test: 

t . = X *- Y * (2 ) 

/ t)QCrg + (w-l)sf 

V vo+n-2 

In this expression Xj, Y~i, and Si are defined as they were in ([!}. The parameter <7q is 
an estimator of the pooled variance across genes, which is calculated using data from 
all the genes, and vq is a tuning parameter that controls the relative contributions of 
the gene-specific variance estimate and the global variance estimate. R code for imple- 
menting this method is available at http://molgen51.biol.rug.nl/cybert/help/index.html. 

Note that ((2j is similar to ([TJ in that the denominator of the t-statistic consists of a lin- 
ear combination of an estimator of the variance of gene i plus a pooled estimate of the 
variance of all the genes. The similarity between the two expressions is not surprising. 
In general Bayesian methods tend to produce biased parameter estimates, but these es- 
timators may have lower variance/mean squared error than unbiased estimators, which 
is the same motivation for considering the regularized variance estimators discussed 
previously. Indeed, in some situations regularized frequentist parameter estimators can 
be shown to be Bayesian estimators with the appropriate choice of prior 40 . 

Other similar Bayesian approaches have also been proposed for different types of mi- 
croarray problems—^. In particular, the "limma" method of Smyth— uses an empir- 
ical Bayes test statistic that consistently performed well in a recent study comparing 
feature selection methods for microarray data—. 



Calculating P- Values 

If a i-test (or other conventional parametric test, such as ANOVA or regression) is used 
to test the null hypothesis of no association between the expression level of a given gene 
and an outcome, then calculating the p-value for this null hypothesis is straightforward 
if the assumptions of the test are satisfied. However, it may be dangerous to assume that 
these test statistics are normally distributed when the sample size is small. Moreover, 
as discussed previously, in many situations it is preferable to use biased estimators of 
the variance of a gene's expression level. When a biased estimator of the variance is 
used, a i-statistic may no longer have a t distribution. Thus, alternative approaches 
may be needed to compute p-values in these situations. 

One possible alternative is to calculate p-values based on the permutation distribution 
of the test statistic. Let tj denote the i-statistic (or other test statistic) associated with 
gene j. Suppose the sample labels are then permuted K times, and let tj t k denote 
the test statistic associated with gene j for the fcth permuted data set. Then one can 
estimate the p- value for gene j (denoted by pj) as follows: 

1 K 

k=l 

Here I(x) denotes an indicator function that is equal to 1 if the condition is true and 
otherwise. In other words, the p-value is estimated by counting the number of times 
that the permuted version of the test statistic is "more extreme" than the original (unper- 
muted) version of the test statistic. A very large (or very small) test statistic is unlikely 
to occur by chance, so very few permuted data sets will produce a larger test statistic 
and the p-value will be small. This approach is used by the "SAM" software package 24 
to calculate p-values. 

This approach requires a choice of the number of permutations K. For small data sets, 
one may simply evaluate all possible permutations. In the case where one wishes to 
compare ?ii samples from one condition to n-i samples from another condition using 
a t-test (or variant thereof), there are a total of ( ni ^ n2 ) possible permutations. How- 
ever, this would be computationally intractable for larger data sets, so it is common to 
arbitrarily select a value of K — 1000 or an even larger number if more precision is 
desired. 

One possible problem with calculating p-values using OJ is that it can be difficult to 
estimate p-values that are close to 0. If \tj\ > \tjk\ for all k, then (f3]l implies that 
Pj = 0, which in reality all that can be inferred is that pj < l/K. This is problematic 
because certain methods for adjusting for multiple hypothesis testing in microarray 
experiments require precise estimation of p-values that are very close to 0. See below 
for more details. 

There are a few possible solutions to this problem. The simplest approach is to increase 
the value of K. This will solve the problem given sufficient computing power, but it 
can be computationally intractable for large data sets. Another possibility is to pool the 



results of all the genes when calculating the permutation p-values. Suppose there are a 
total of N genes in the experiment. Then we estimate pj as follows: 



N K 

^ = ]vFEE 7 (i M> fei) (4) 

»=i /t=i 

In other words, rather than simply counting the number of times that the permuted test 
statistic for gene j is more extreme than the unpermuted test statistic for gene j, one 
counts the number of times that the permuted test statistic for any gene is greater than 
the permuted test statistic for gene j. This can increase the precision of the estimates 
of Pj without increasing the computational burden. See Hastie et al.~ for a complete 
discussion of calculating p-values based on the permutation distribution of the test 
statistics. 



Methods for Combining Information Across Genes 

The methods discussed thus far assume that hypothesis tests will be performed on each 
gene one at a time, and that the results of a hypothesis test on a given gene will not be 
affected by the hypothesis tests performed on other genes. This strategy may be inef- 
ficient on DNA microarray studies. Genes often act in pathways, meaning that several 
genes may be involved with the same biological process and hence be activated and 
deactivated simultaneously. If several related genes show evidence of differential ex- 
pression at the same time, that is stronger evidence that the differential expression rep- 
resents biological signal than if such a pattern were observed for a single gene. Several 
methods have been proposed for combining information across genes when searching 
for differentially expressed genes in microarray studies, which will be discussed below. 



Biologically Motivated Methods 

One approach for combining information across genes is to utilize known biological 
relationships among the genes. Typically genes are classified into groups using biolog- 
ical databases such as Gene Ontology—. Each group represents a set of biologically 
similar genes. The most commonly used methods compare the number of significant 
features in each group to the number expected if the genes in the group are not differ- 
entially expressed. If there are an unusually high number of significant features in a 
given group, that suggests that the pathway corresponding to the group is differentially 
expressed. 

One strategy for identifying pathways containing differentially expressed genes is known 
as over-representation analysis (ORA). ORA first identifies a list of "significant" genes 
using any of the previously described methods for detecting differentially expressed 
genes. The M "most significant" genes are selected, which are typically the genes 
with the smallest p-values. Then for each group of genes, Fisher's exact test (or some 
approximation thereof) is used to test the null hypothesis that the number of genes 
called significant in each group does not exceed the number of genes expected to be 
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called significant due to chance. Various implementations of ORA have been proposed 
in the literature^^, and it has been used in some microarray experiments—. 

Despite the popularity of ORA, it has several shortcomings. Typically only the top M 
genes are used to compute the ORA statistics, resulting in the loss of any information 
available from genes not among the M most significant genes. The choice of M is 
often arbitrary as well. Moreover, all of the top M genes are treated equally, meaning 
that genes with extremely small univariate p-values are given the same weight as genes 
whose univariate p-values are much larger. Finally, ORA considers the gene to be the 
unit of analysis rather than the subject, which is inappropriate in virtually all real-world 
situations. Among other issues, it implies that the gene sets should be independent of 
one another, which is almost certainly not true in practice. See Pavlidis et al. 55 , Tian 
et al.— , or Allison et al.— for more information on the shortcomings of ORA. 

An alternative strategy that avoids the problems associated with ORA is gene set 
enrichment analysis (GSEA). GSEA functions as follows: First, the genes are or- 
dered according to their t-statistics or p-values or some other measure of univari- 
ate statistical significance. Then for each group of genes, the distribution of the t- 
statistics of the genes in the group is compared to the distribution of the genes not 
in the group using a one-sided Kolmogorov-Smirnov statistic or some other similar 
statistic. The idea is that if a group of genes is differentially expressed, then the 
distribution of the t-statistics among that set of genes should be different than the 
distribution of the t-statistics among the remaining genes. A p-value can be calcu- 
lated for each set of genes by permuting the data multiple times and using (0) or 
or alternative methods. Various implementations of GSEA have been proposed 
in the literature- 5 ^—. There are also several software implementations of GSEA. For 
example, the Broad Institute offers software to perform GSEA in both Java and R 
(http://www.broadinstitute.org/gsea/index.jsp) and a variant of GSEA is implemented 
in the "SAM" software package. 

The main shortcoming of GSEA is the fact that it tests a "competitive null hypothesis." 
Suppose we have two groups of genes, which we will call gene group 1 and gene group 
2. Then a smaller p-value for testing the null hypothesis of no differential expression 
in gene group 1 implies a larger p-value for testing this null hypothesis in gene group 
2 even if the expression levels in gene group 2 remain unchanged. This occurs because 
the p-value for gene group 2 is calculated by comparing the test statistics of the genes 
in gene group 2 to the test statistics of all genes not in gene group 2, including the test 
statistics in gene group 1. Thus, if extreme test statistics are observed in gene group 1, 
this decreases the significance of gene group 2. See Damian and Gorfine— or Allison 
et al.— for a more detailed discussion of this phenomena. The development of methods 
for identifying groups of genes associated with an outcome of interest that avoids the 
shortcomings of ORA and GSEA is an active research area. 



Statistically Based Methods 

Other strategies for combining information across genes use novel statistical methods 
that do not require any knowledge of the biological relationship between the genes. We 
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have previously discussed one possible statistical strategy for combining information 
across genes, namely regularized or Bayesian estimators of the variance of individual 
genes. By using information about the variance of other genes to estimate the variance 
of a specific gene, the variance of the test statistic is greatly decreased, and hence the 
risk of false positives and false negatives is also decreased. However, in recent years 
there several more advanced methods have been proposed for combining information 
across genes which we will briefly describe below. 

One strategy is known as the optimal discovery procedure (ODP) 66 i 67 . The motivation 
for ODP is similar to the motivation for the pathway-based methods discussed previ- 
ously. Since genes function in pathways, we expect that genes in the same pathway are 
likely to be co-expressed. Thus, if a gene shows evidence of differential expression, 
one can be more confident that the differential expression is not due to chance if other 
genes show a similar expression pattern. See Figure |6]for an illustration of this idea. 
The difference between ODP and pathway -based methods is that pathway -based meth- 
ods require one to know in advance which genes are expected to be co-expressed based 
on previously collected biological data whereas ODP does not. 

[Figure 6 about here.] 

The ODP is a generalization of the Neyman-Pearson lemma—. The Neyman-Pearson 
lemma states that the most powerful test of a given null hypothesis against a given 
alternative hypothesis rejects the null hypothesis when the ratio 

probability of the observed data under the alternative hypothesis 
probability of the observed data under the null hypothesis 

is large. The ODP generalizes the Neyman-Pearson lemma to situations where multiple 
hypotheses are tested by rejecting the null hypothesis that gene i is not differentially 
expressed when the ratio 

sum of the probabilities of observing data i under each alternative hypothesis 
sum of the probabilities of observing data i under each null hypothesis 

is large. Thus, if a set of genes with similar expression patterns all show evidence of 
differential expression, then (|6]l will be larger than (|5} for a given gene in the set, mean- 
ing that the null hypothesis of no differential expression is more likely to be rejected 
under ODP than under the traditional Neyman-Pearson paradigm. 

In practice, (|6]l cannot be computed exactly and must be approximated.^^ Software 
for computing the ODP is publicly available 69 . 

An alternative strategy for combining information across genes without any biological 
information about the relationship between the genes is the Lassoed Principal Compo- 
nents (LPC) method of Witten and Tibshirani ™ . The motivation for LPC is similar to 
the motivation for ODP: A gene is more likely to be differentially expressed if there 
are other genes with similar expression patterns than it is if there are no such similar 
genes. However, LPC uses a different strategy to determine if there are other genes 
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with similar expression patterns. The idea behind LPC is that if a group of genes are 
co-regulated, then it is likely that a principal component of the gene expression matrix 
(sometimes called an eigenarray— ) will capture the variance in this group of genes. 
Thus, the LPC algorithm attempts to identify an eigenarray or group of eigenarrays 
that are associated with the biological process of interest and projects the ^-statistics 
(or other relevant test statistics) onto this group of eigenarrays. This method can be 
shown to significantly reduce the false discovery rate in a variety of situations.— This 
method is implemented in the "lpc" R package. 



Clustering and Prediction Methods 

Identifying features associated with an outcome of interest is not the only objective of 
microarray studies. One may also wish to partition the data into homogeneous sub- 
groups and/or use the data to predict an outcome of interest. Clustering methods and 
prediction methods are useful in this situation. 

There is a vast literature devoted to methods for clustering or predicting an outcome 
based on microarray data. A full description of such methods is beyond this scope of 
this review (which focuses on feature selection). However, it is noteworthy that there 
are methods for clustering^^^^ 4 - and predictioni*^ 5 - - — that also perform feature selec- 
tion. These methods generally do not evaluate whether a selected gene is "statistically 
significant" nor do they indicate which genes are the "most significant." Also, the user 
of these methods often has limited control over the number of features selected. Thus, 
these methods have serious disadvantages if feature selection is the primary goal of 
the analysis. Nevertheless these methods can identify a list of genes for further study, 
particularly in cases where clustering and prediction are important goals of the experi- 
ment. 



Comparison of Feature Selection Methods 

Numerous methods have been proposed for identifying significant features in DNA 
microarray data. However, the question of which methods produce the best results 
(i.e. maximize power while controlling type I error) has not been studied extensively. 
In practice researchers often choose feature selection methods based on the ease of 
implementing the method rather than the performance of the method. The "SAM" 
software package has become a popular tool for microarray analysis largely due to 
the fact that it is available as an Excel add-in and does not require the use of R or 
command-line programs. 

Limited research indicates that the "limma" method of Smyth— performs well for 
a wide variety of problems, although other methods may perform better in specific 
situation s 46 ' 70 ' 95 . Limma is implemented in the "limma" R package, which is available 
from Bioconductor. Determining which feature selection method is likely to produce 
the best results on a given data set is an important area for future research. 
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Issues Related to Multiple Hypothesis Testing 

Identifying significant genes in microarray studies requires performing a large number 
of hypothesis tests, which presents statistical challenges. When performing a single hy- 
pothesis test, it is conventional to choose a significance level a such that the probability 
of rejecting the null hypothesis when it is true is equal to a. However, when multiple 
hypothesis tests are performed, the probability of at least one false positive test will be 
much larger than a. Thus, methods are needed to control the number of false positive 
tests while maintaining sufficient power to identify truly significant genes. 



The Family-Wise Error Rate 

One possible solution is to control the family-wise error rate (FWER) at a specified 
level. The FWER is defined to be the probability of rejecting at least one null hypoth- 
esis that is true. The most common way to control the FWER at a specified level is 
to use a Bonferroni correction: Each individual null hypothesis is rejected if and only 
p < a/N, where p is the p-value for the test and N is the total number of tests. It is 
easy to show that the probability of at least one type I error is no greater than a using 
this procedure. 

Although the Bonferroni correction controls the number of false positive tests, it is 
a very stringent criterion that typically results in a substantial loss of power. In ex- 
periments with small sample sizes it is common for no tests to satisfy the Bonferroni 
criteria. Thus, most microarray analysts prefer less stringent approaches^. Methods 
exist for controlling the FWER using more permissive criteria than the Bonferroni cor- 
rection 396 , but these methods also suffer from lower power and are not commonly 
used. 



The False Discovery Rate 

The false discovery rate (FDR) is defined to be the expected proportion of false posi- 
tives among the set of genes that are called significant. One may also adjust for multiple 
comparisons by controlling the FDR rather than the FWER. This approach typically 
yields greater power than FWER-based methods and hence is generally regarded as 
preferable—. 

The FDR was first proposed by Benjamini and Hochberg—. To control the FDR at 
a given level a, they proposed the following procedure: Let pm < p(o) < • • • < 
P(N) be the ordered p-values, and let -ff(i), ^(2)7 ■ ■ ■ ■> H(N) be the corresponding null 
hypotheses. Then reject H^ , H( 2 ) > • • • H(j)i where 

j = max{z : p (i) < ai/N} (7) 

% 

Benjamini and Hochberg— prove that the FDR of this procedure is at most a. This 
procedure is always valid if the p-values are independent. It remains valid in some cases 
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even when dependency exists among the p-values, and methods exist for estimating the 
FDR where any type of dependency exists.— ~— 

Rather than choosing a specific FDR in advance, one may wish to estimate the FDR 
when the top m genes are called significant. This is easy to do using the methodology 
of Benjamini and Hochberg— : If we let 

a=P( m )N/m (8) 

Then © implies that the FDR should be approximately a. 

If one estimates the null distribution of the test statistics using permutations of the data 
as in (O and ©, then an alternative estimator of the false discovery rate may be used. 
Once again, let tj denote the t-statistic (or other test statistic) associated with gene j, 
and let tjk denote the test statistic associated with gene j for the fcth permuted data set. 
Also, let t (i) < t(2) < • • • < tfN) be the order statistics of the absolute values of the 
tj's. Then one may estimate the FDR a when the top m genes are called significant as 
follows: 

N K 

^^EE^^v-)) (9) 

8=1 fc=l 

In other words, one estimates the FDR by dividing the average number of genes called 
significant over K permuted data sets by the number of genes called significant in the 
unpermuted data set (which is m, since the top m genes were called significant). It can 
be shown that a in (O is a consistent estimator of the FD R 99 ' 111 . Also, it can be shown 
that estimators © and (0 are equivalent—. There are several R packages which will 
compute the FDR using this methodology (such as the "multtest" package, which is 
available from CRAN, and the "fdrame" package, which is available from Bioconduc- 
tor). This methodology is also implemented in the "SAM" software package. 



The Q-Value 

In multiple testing problems, the q-value-i^ of a given test statistic t is defined to be the 
smallest possible FDR that can occur among all possible rejection regions that reject 
the null hypothesis when T = t. For example, if a t-statistic is calculated for each gene 
and the jth such t-statistic is tj and \tj\ — C, then the q-value for the jth hypothesis 
test is the FDR for the rejection region \ti\ > C. In other words, the q-value is the 
FDR that results when one calls gene j significant along with all other genes that have 
a more extreme test statistic than gene j. Obviously genes with more extreme test 
statistics will have smaller q-values. The q-value may be estimated using ([8]) or (O, 
although other approaches are possible (see below). The q-value may be calculated 
using the "qvalue" R package (available from Bioconductor) as well as the "SAM" 
software package. 

A Bayesian interpretation of the q-value is possible, as described in Storey — , Efron 
and Tibshirani— and Storey—. Suppose that each gene comes from one of two pop- 
ulations, one of which consists of genes that are differentially expressed, and the other 
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which consists of genes that are not differentially expressed. Under this assumption, 
the test statistic for each gene may be modeled using a mixture model. Define a set 
of random variables Zj such that Zj = if gene j is not differentially expressed and 
Zj = 1 if gene j is differentially expressed. Also let \tj\ = C, and let q(tj) be the 
q-value corresponding to gene j. Then one can sho w 111 ' 114 that 

q(tj) = P(Zj = 0\\tj\ > C) (10) 

In other words, under this mixture model, the q-value is the posterior probability that 
the jth null hypothesis is true given the test statistic for gene j. Although we have 
assumed a rejection region of the form |i, > C in ( [Tol l, this result holds under more 
general rejection regions. 

Note: ( fTUb is only true if we calculate the p- value based on the positive false discovery 
rate (pFDR) as defined by Storey— rather than the traditional FDR proposed by Ben- 
jamini and Hochberg— . The pFDR is defined to be the expected proportion of false 
positives among the set of genes that are called significant conditional on the fact that 
at least one gene is called significant. 

Methods exist for directly estimating the mixture distribution under the model de- 
scribed above and thereby estimating the pFDR/q-value corresponding to individual 
genes using ( fTOb or similar procedures 115-118 . Limited research suggests that many 
of these methods produce comparable results 29 ' 119 . Such mixture models may also be 
used for omnibus testing. 12( ¥ 21 

Conclusion 

Microarrays have been important for a variety of biological applications for over 
a decade. However, technology for generating high-throughput biological data 
is improving at a rapid pace, and technology that is commonly used today may 
be replaced in the near future. Indeed, some have suggested that newer tech- 
nologies such as RNA-seq may soon replace microarrays 122 just as microar- 
rays have largely replaced older techniques such as Northern blotting 12 £. As 
technology advances, new methods will be necessary for analyzing data sets 
generated by the new techniques, and some methods for analyzing microarray 
data may no longer be useful in the future if microarrays are replaced by newer 
methods. Indeed, methods for identifying differentially expressed genes based 
on RNA-seq data is currently an active research area.— - — 

Despite these changing technologies, we feel that a discussion of methods 
for analyzing microarray data is still relevant and timely. DNA microarrays are 
still cheaper than RNA-seq assays, and RNA-seq gene expression measure- 
ments can be unreliable for genes expressed at lower levels-^. More impor- 
tantly, however, many of the statistical techniques that have been developed 
for analyzing microarray data can also be applied to data produced by other 
high-throughput biological assays. For example, using normalized or Bayesian 
estimators of the variance of an estimator is useful for performing feature se- 
lection in any situation where the number of features is large and the number 
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of observations is small. Similarly, use of the FDR and pFDR to control type I 
error is useful for a wide variety of multiple testing problems, which arise in the 
analysis of nearly all types of modern high-throughput biological data. For ex- 
ample, the "SAM" software package for DNA microarray analysis was recently 
upgraded to analyze RNA-seq data in addition to DNA microarray data. The 
new method continues to use resampling-based approaches to estimate the 
null distribution of each test statistic which is then used to estimate the FDR. 
See Li and Tibshiranii^ for details. Likewise, GSEA and other pathway-based 
methods for feature selection have been applied to genome-wide association 
studies— - —. Thus, we see that the methods developed for DNA microarray 
analysis will be useful for many years in the future even as technology changes. 

Further Reading 

Cui and Churchill 134 and Allison etal. 29 provide good overviews of feature 
selection methods for microarray data. Jeffery et al. 46 describe several of the 
most commonly used feature selection methods for microarrays and compare 
the performance of these methods on 9 publicly available data sets. There are 
also numerous books containing information on feature selection and other 
aspects of microarray data analysis not considered in this review. Good refer- 
ences include Causton etal. 135 , Parmigiani etal.—, Speed—, Wit and Mc- 
Clure 138 , McLachlan et aL-12 9 ., Do et al.iiS, and Draghiciiil. 
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Figure 1: Illustration of a typical microarray experiment (using cDNA technology). 
First, mRNA is extracted from two groups of cells, namely an experimental sample of 
interest and a control sample. Each sample is labeled with a different color of fluores- 
cent dye. The samples are then combined and hybridized onto an array. The relative 
abundance of the mRNA corresponding to a particular gene can be measured by calcu- 
lating the ratio of red dye to green dye at the appropriate spot on the array. 
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Figure 2: Image of a DNA microarray slide. One may measure the relative gene ex- 
pression of each gene by comparing the ratio of the amount of red dye to the amount 
of green dye at each probe on the array. 
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Figure 3: Heat map of the leukemia microarray data of Bulling er et al.— . Each col- 
ored square on the map corresponds to the expression level of a given gene for a given 
patient. In the above figure, each row represents a gene and each column represents a 
patient. The brighter the color of a given square, the higher (or lower) the expression 
level of the corresponding gene. Usually hierarchical clustering is performed on the 
rows and columns of the data set prior to drawing the heat map. 
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Figure 4: Illustration of the bias-variance trade-off. The above figure shows a regres- 
sion problem where the objective is to predict y given a value of x. The dotted line 
shows the true relationship between x and y. The linear regression estimator (shown in 
blue) has high bias and low variance, and the interpolation estimator (shown in orange) 
has low bias and high variance. 
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Figure 5: Illustrates the association between the complexity of a model and the 
bias/variance of the model. In general, as the complexity of a model increases, the 
variance of the model increases and the bias of the model decreases. 
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Figure 6: Illustration of the ODP procedure. Suppose that the test statistic for the null 
hypothesis of no differential expression is t = —2 for one gene and t = 2 for a second 
gene. Suppose further that there are several other genes with similar expression patterns 
to the second gene for which t«2, Using traditional hypothesis testing procedures, 
one would be equally likely to reject the null hypothesis of no differential expression 
for both of the two genes. Using ODP, one would be more likely to reject the null 
hypothesis for the gene where t = 2, since the existence of several genes with similar 
expression patterns increases ones confidence that the result is not due to chance. 
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